Compare Revisions - nigel.stanger/Wiki

nigel.stanger / Wiki

Compare Revisions
View Page Back to Page History

Transcribing lectures using Whisper.md
Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing. Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect. Assuming a good quality recording, the following settings seem to do a good job: * Medium model. This performs better than the large model when transcribing English because the large model isn’t English-specific. * Enable word timestamps. * Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”. * VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.) * Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else. For example: ```sh whisper --model medium --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 <input-file> ``` VS Code extension issues: * Missing feature: merge subtitles. Merges selected subtitles into one and adjusts timestamps accordingly. * Bug: Sometimes adjusting timing leads to timestamps like `00:40:48.1000`, which should actually be `00:40:49.000`. Clearly there is something slightly wonky with the arithmetic. Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing. Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect. Assuming a good quality recording, the following settings seem to do a good job: * Medium model. This performs better than the large model when transcribing English because the large model isn’t English-specific. * Enable word timestamps. * Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”. * VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.) * Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else. For example: ```sh whisper --model medium --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 <input-file> ```

Transcribing lectures using Whisper.md

Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing.

Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect.

Assuming a good quality recording, the following settings seem to do a good job:

* Medium model. This performs better than the large model when transcribing English because the large model isn’t English-specific.
* Enable word timestamps.
* Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”.
* VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.)
* Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else.

For example:

```sh
whisper --model medium --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 <input-file>
```

VS Code extension issues:

* Missing feature: merge subtitles. Merges selected subtitles into one and adjusts timestamps accordingly.
* Bug: Sometimes adjusting timing leads to timestamps like `00:40:48.1000`, which should actually be `00:40:49.000`. Clearly there is something slightly wonky with the arithmetic.

Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing.

Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect.

Assuming a good quality recording, the following settings seem to do a good job:

For example:

```sh
whisper --model medium --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 <input-file>
```